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Abstract 


The  objective  of  this  research  is  to  develop  eflScient  parallel  algorithms  for  solving  large  sparse 
linear  systems  of  equations.  In  particular,  it  looks  at  direct  solvers  for  solving  sparse  linear  systems, 
hierarchical  algorithms  for  n-body  simulations,  and  fast  and  high  quality  graph  partitioners.  As 
a  part  of  this  research,  we  have  developed  and  implemented  a  massively  parallel  formulation  of 
sparse  Cholesky  factorization.  This  implementation  delivers  up  to  20  GFLOPS  on  a  1024  processor 
Cray  T3D  even  for  medium  sized  problems.  This  is  the  highest  performance  obtained  on  any 
supercomputer  (vector  or  parallel)  for  sparse  Cholesky  factorization.  We  have  also  developed  a 
fast  and  high  quality  graph  partitioning  algorithm  that  is  roughly  two  orders  of  magnitude  faster 
than  widely  used  spectral  methods,  and  produces  better  quality  partitions.  We  have  developed 
massively  parallel  formulations  of  particle  simulation  techniques  such  as  Fast  Multipole  and  Barnes- 
Hut  methods.  We  have  applied  this  formulation  to  astrophysical  simulations  and  for  computing 
the  core  matrix-vector  product  in  dense  boundary  element  solvers. 
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1  Problem  Statement 


Virtually  all  scientific  and  natural  phenomena  can  be  modeled  as  systems  of  differential  equations 
that  are  solved  using  finite  element  and  finite  difference  methods.  The  objective  of  this  project  is 
to  solve  linear  systems  arising  from  these  methods.  These  sparse  linear  systems  are  too  large  to 
be  solved  cost  effectively  on  traditional  vector-supercomputers.  This  project  aims  at  developing 
highly  parallel  linear  system  solvers  and  investigating  their  applications  in  problems  of  interest 
to  US  Army.  This  work  has  considerable  significance  since  it  will  enable  modeling  accuracies  and 
discretizations  much  finer  than  currently  possible.  It  will  also  result  in  robust  and  portable  software 
that  can  be  used  for  a  variety  of  applications. 

2  Summary  of  Important  Results 

Direct  methods  for  solving  sparse  linear  systems  are  important  because  of  their  generality  and 
robustness.  For  linear  systems  arising  in  certain  applications,  such  as  linear  programming  and  some 
structural  engineering  applications,  they  are  the  only  feasible  methods.  Although  highly  parallel 
formulations  of  dense  matrix  factorization  are  well  known,  it  has  been  a  challenge  to  implement 
efficient  sparse  linear  system  solvers  using  direct  methods,  even  on  moderately  parallel  computers. 

We  have  recently  achieved  a  breakthrough  in  developing  a  highly  parallel  sparse  Cholesky 
factorization  algorithm  that  substantially  improves  the  state  of  the  art  in  parallel  direct  solution 
of  sparse  linear  systems — both  in  terms  of  scalability  and  overall  performance.  Experiments  have 
shown  that  this  algorithm  can  easily  speedup  Cholesky  factorization  by  a  factor  of  at  least  a  few 
hundred  on  up  to  1024  processors,  and  achieve  levels  of  performance  that  were  unheard  of  and 
unimaginable  for  this  problem  until  very  recently. 

It  is  a  well  known  fact  that  dense  matrix  factorization  scales  well  and  can  be  implemented 
efficiently  on  parallel  computers.  We  have  shown  that  our  parallel  sparse  factorization  algorithm 
is  asymptotically  as  scalable  as  the  best  dense  matrix  factorization  algorithms  on  a  variety  of 
parallel  architectures  for  a  wide  class  of  problems  that  include  all  two-  and  three-dimensional 
finite  element  problems.  This  algorithm  incurs  less  communication  overhead  than  any  previously 
known  parallel  formulation  of  sparse  matrix  factorization,  and  therefore,  is  suitable  for  workstation 
clusters  that  tend  to  be  connected  via  relatively  low-bandwidth  and  high- latency  channels  relative 
to  the  traditional  MPP  platforms.  We  have  successfully  implemented  this  algorithm  for  Cholesky 
factorization  on  a  variety  of  parallel  computers,  such  as  nCUBE2,  CM-5,  IBM  SP-1  and  SP-2, 
and  the  Cray  T3D.  The  implementation  on  the  T3D  delivers  up  to  20  GFlops  on  1024  processors 
for  medium-size  structural  engineering  and  linear  programming  problems.  Although  our  current 
implementations  work  for  Cholesky  factorization,  the  algorithm  can  be  adapted  for  solving  sparse 
linear  least  squares  problems  and  for  Gaussian  elimination  of  diagonally  dominant  matrices  that 
are  almost  symmetric  in  structure. 

Fast  and  accurate  graph  partitioning  algorithms  are  needed  for  the  solution  of  sparse  system  of 
linear  equations  Ax  =  b  on  a  parallel  computer.  In  the  case  of  direct  solvers,  a  graph  partitioning 
algorithm  can  be  used  to  reorder  the  matrix  so  that  the  amount  of  fill  is  minimized,  and  the  con- 
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currency  that  can  be  exploited  during  parallel  factorization  is  maximized.  In  the  case  of  parallel 
iterative  solvers,  the  graph  corresponding  to  matrix  A  needs  to  be  partitioned  into  p  parts  so  that 
the  number  of  edges  with  vertices  on  different  partitions  is  minimized.  Many  heuristic  algorithms 
are  known  for  finding  good  partitions  of  a  graph.  Algorithms  that  provide  good  partitions  of  the 
graph  (e.^.,  spectral  methods)  tend  to  be  very  slow,  especially  for  large  graphs.  Faster  algorithms 
tend  to  compromise  on  the  quality  of  the  partition.  In  the  context  of  direct  methods,  good  se¬ 
quential  partitioning  methods  can  take  even  more  time  than  the  factorization  step  running  on  a 
parallel  computer,  and  cheaper  methods  result  in  high  degree  of  fill  in  the  matrix,  causing  overall 
factorization  time  to  jump  up  by  a  large  factor. 

We  have  recently  developed  a  multilevel  graph  partitioning  scheme  that  consistently  outperforms 
the  spectral  partitioning  schemes  in  terms  of  cut  size  and  is  substantially  faster.  We  also  used  our 
graph  partitioning  scheme  to  compute  fill  reducing  orderings  for  sparse  matrices.  Surprisingly, 
our  scheme  substantially  outperforms  the  multiple  minimum  degree  algorithm  (MMD),  which  is 
the  most  commonly  used  method  for  computing  fill  reducing  orderings  of  a  sparse  matrix.  The 
edge-cut  produced  by  our  multilevel  scheme  is  significantly  better  than  that  produced  by  the  MSB 
scheme,  and  our  algorithm  is  20  times  faster  than  MSB  on  the  average.  Furthermore,  our  multilevel 
scheme  does  consistently  better  as  the  size  of  the  matrices  increases  and  as  the  matrices  become 
more  unstructured. 

Particle  simulations  find  extensive  applications  in  various  engineering  and  scientific  domains. 
Important  applications  of  this  problem  are  in  astrophysical  simulations,  fiuid  dynamics,  design  of 
composites,  and  protein  synthesis.  Exact  simulation  of  the  behavior  of  n  particles  requires  the 
computation  of  forces  during  each  time-step.  Given  the  large  number  of  particles  involved,  this 
represents  a  computationally  unattainable  task.  Hierarchical  methods  reduce  the  this  complexity  to 
0{n)  or  0{nlogn)  by  aggregating  the  effect  of  spatially  proximate  particles  into  a  single  expression. 
However,  for  highly  irregular  distributions,  these  methods  are  difficult  to  parallelize.  We  have 
developed  a  highly  scalable  parallel  formulation  of  the  Barnes-Hut  method  for  n-body  simulations. 
We  have  used  this  formulation  for  performing  astrophysical  simulations  and  demonstrated  the 
excellent  raw  and  parallel  performance  of  our  schemes  on  a  256  processor  nCUBE2  and  a  CMS.  We 
have  also  studied  its  performance  and  error  properties  in  the  context  of  computing  matrix-vector 
products  for  dense  iterative  solvers. 
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